Recovering Semantics of Tables on the Web

نویسندگان

Petros Venetis

Alon Y. Halevy

Jayant Madhavan

Marius Pasca

Warren Shen

Fei Wu

Gengxin Miao

Chung Wu

چکیده

The Web offers a corpus of over 100 million tables [6], but the meaning of each table is rarely explicit from the table itself. Header rows exist in few cases and even when they do, the attribute names are typically useless. We describe a system that attempts to recover the semantics of tables by enriching the table with additional annotations. Our annotations facilitate operations such as searching for tables and finding related tables. To recover semantics of tables, we leverage a database of class labels and relationships automatically extracted from the Web. The database of classes and relationships has very wide coverage, but is also noisy. We attach a class label to a column if a sufficient number of the values in the column are identified with that label in the database of class labels, and analogously for binary relationships. We describe a formal model for reasoning about when we have seen sufficient evidence for a label, and show that it performs substantially better than a simple majority scheme. We describe a set of experiments that illustrate the utility of the recovered semantics for table search and show that it performs substantially better than previous approaches. In addition, we characterize what fraction of tables on the Web can be annotated using our approach.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

This paper discusses about the future of the World Wide Web development, called Semantic Web. Undoubtedly, Web service is one of the most important services on the Internet, which has had the greatest impact on the generalization of the Internet in human societies. Internet penetration has been an effective factor in growth of the volume of information on the Web. The massive growth of informat...

متن کامل

Applying WebTables in Practice

We started investigating the collection of HTML tables on the Web and developed the WebTables system a few years ago [4]. Since then, our work has been motivated by applying WebTables in a broad set of applications at Google, resulting in several product launches. In this paper, we describe the challenges faced, lessons learned, and new insights that we gained from our efforts. The main challen...

متن کامل

Synthesizing Union Tables from the Web

Several recent works have focused on harvesting HTML tables from the Web and recovering their semantics [Cafarella et al., 2008a; Elmeleegy et al., 2009; Limaye et al., 2010; Venetis et al., 2011]. As a result, hundreds of millions of high quality structured data tables can now be explored by the users. In this paper, we argue that those efforts only scratch the surface of the true value of str...

متن کامل

TabEL: Entity Linking in Web Tables

Web tables form a valuable source of relational data. The Web contains an estimated 154 million HTML tables of relational data, with Wikipedia alone containing 1.6 million high-quality tables. Extracting the semantics of Web tables to produce machine-understandable knowledge has become an active area of research. A key step in extracting the semantics of Web content is entity linking (EL): the ...

متن کامل

A Large Public Corpus of Web Tables containing Time and Context Metadata

The Web contains vast amounts of HTML tables. Most of these tables are used for layout purposes, but a small subset of the tables is relational, meaning that they contain structured data describing a set of entities [2]. As these relational Web tables cover a very wide range of different topics, there is a growing body of research investigating the utility of Web table data for completing cross...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 4 شماره

صفحات -

تاریخ انتشار 2011

Recovering Semantics of Tables on the Web

نویسندگان

چکیده

منابع مشابه

Hierarchical Fuzzy Clustering Semantics (HFCS) in Web Document for Discovering Latent Semantics

Applying WebTables in Practice

Synthesizing Union Tables from the Web

TabEL: Entity Linking in Web Tables

A Large Public Corpus of Web Tables containing Time and Context Metadata

عنوان ژورنال:

اشتراک گذاری